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Abstract 

Progress in language and image understanding by ma¬ 
chines has sparkled the interest of the research com¬ 
munity in more open-ended, holistic tasks, and refu¬ 
eled an old AI dream of building intelligent machines. 
We discuss a few prominent challenges that character¬ 
ize such holistic tasks and argue for “question answer¬ 
ing about images” as a particular appealing instance of 
such a holistic task. In particular, we point out that it 
is a version of a Turing Test that is likely to be more 
robust to over-interpretations and contrast it with tasks 
like grounding and generation of descriptions. Finally, 
we discuss tools to measure progress in this field. 


Introduction 


Progress in machine perception and language under¬ 
standing (e.g. (iKrizhevskv et all 120121: iLiang et akl 120 1 3l) ) 
has inspired researchers to work on holistic tasks 
that interlink both modalities together in a complex 
chain of perception, representation and inference. Ex¬ 
amples include: grounding dKrishnamurthv arid Kollai . 


2013 b, languag e gene ration dKarpathv and Fei -FeiL 


Donahue et al 


ige gener 
I 120141) . 


retrieval (Karpathy et al 


2014 


[2014 


Malinowski and Fritz, 2014 bj), and ques tion answering 


about images (Malinowski and Fritz, 2014a,c). 

Recently, [Malinowski andFritzi ( 2014aii have presented 
an approach for question answering abo ut Jmages that 
resembles the famous Turing Test (iTuringj [1950b , while 
iMalinowski and Fritz! d2014d ) further discuss some of the 
associated challenges and issues. In the following, we elab¬ 
orate on data acquisition, contrast this challenge with other 
tasks including grounding, language generation, as well as 
highlight properties like robustness to over-interpretation, 
which makes it hard to cheat such a test. 


Challenges 

Architectures working on a holistic task such as question an¬ 
swering based on images need to deal with a large gamut of 
challenges. In this section, we have distilled a few prominent 
ones that require a joint reasoning over language and visual 
inputs. We also argue that holistic architectures can benefit 
from a common sense knowledge. Finally, we discuss chal¬ 
lenges in data acquisition and show how the task differs from 
other well known tasks. 


Vision and language Scalability: Vision and language 
systems ground any internal representation in an external 
world that serves as a common reference point for ma¬ 
chines and humans. The human conceptualization divides 
these percepts into different instances, categories as well as 
spatio-temporal concepts. Architectures that aim at repro¬ 
ducing this space of human concepts need to capture the 
same diversity and therefore scale up to thousands of con¬ 
cepts. 

Concept ambiguity: As the number of categories grows, the 
semantic boundaries become more fuzzy, and hence ambi¬ 
guities and gradual memberships are inherently introduced. 
For instance, difference between ’night stand’ and ’cabinet’, 
or ’armchair’, ’chair’ and ’sofa’ can be blurry. Such ambi¬ 
guities are challenging in at least two ways. Methods need 
to distinguish fine-grained differences between these objects 
when appropriate. Objective functions and evaluating met¬ 
rics need to gradually penalize the methods for their mis¬ 
takes. 

Ambiguity in reference resolution: The quality of an answer 
depends on how ambiguous and latent notions of reference 
frames and intentions are understood (IMalinowski and Fritzl 
l2014al) . Depending on the cultural bias and the context, we 
may use object-centric or observer-centric or world-centric 
frames of reference (iLevinsonl 12003 ). Moreover, it is no uni¬ 
fied notion what ’with’, ’beneath’, ’over’ mean. 

Common sense knowledge Interestingly, some questions 
can be quite reliably answered with access to common sense 
knowledge. For instance ’’Which object on the table is used 
for cutting?” already narrows down the likely options signif¬ 
icantly. Such example suggests that question-answering ar¬ 
chitectures would significantly benefit from common sense 
knowledge. 

An ’object for cutting’ is not directly visual but about the 
affordance of the object and therefore a challenging con¬ 
cept to acquire from images only. On the other hand, co¬ 
occurrences in visual data can represent a kind of visual 
common sense knowledge of very mundane facts or prob¬ 
abilistic relations that are rarely found in common sense 
knowledge bases. 

Annotations We argue that despite the aforementioned 
challenges, “question answering about images” has unique 














































advantages over other tasks in terms of data acqui¬ 
sition and task evaluation. In contrast to grounding 
(iKrishnamurthv and Kollaii 1201 3l) . annotating images with 
question and answer pairs does not require a detailed annota¬ 
tions of whole scenes in terms of predicates representing ob¬ 
jects and their relations. The task is also agnostic to the inter¬ 
nal repr esentation of a method. In contrast to language gen - 
eration dKarpathv and Fei-Feil l2014t iDonahue et allT2014l) . 
the output space of a question answering task is more re¬ 
stricted and hence evaluation of different architectures on 
the task is easier to formulate. In contras^ to typical com¬ 
puter vision tasks like object detection (Everingha m et all 
2010), architectures are judged solely on right answers, not 
an internal representation. In contrast to the traditional Tur¬ 
ing Test (iTuringiri950l) . “answering questions about images” 
is less prone to over-interpretations via associating a mean¬ 
ing to machine answers by the human interrogator. Hence, 
a method can be forced to answer to the point rather than 
“cheating” by giving generic answers or output that is open 
to interpretations. 

Evaluation of architectures 

Measuring progress on holistic tasks require identifying its 
goals. For instance a suitable metric for “question answer¬ 
ing about images” should evaluate architectures based on 
produced answers but not on intermediate results such as 
detections or logical forms. For a Visual Turing Challenge, 
we seek a metric that satisfies several properties. The most 
important are: 

Automation: Evaluating answers on such complex tasks as 
answering on questions requires a quite deep understand¬ 
ing of natural language, involved concepts and hidden in¬ 
tentions of the questioner. The ideal but impractical metric 
would be to manually judge every single answer of every 
architecture individually. Therefore, we are seeking an auto¬ 
matic approximation so that we can evaluate different holis¬ 
tic architectures at scale. [Malinowski and Fritz! ( 2014a ) pro¬ 
posed to restrict the answer space in order to achieve this 
goal, while leaving the questions unconstraint. 

Social consensus: The complex tasks that we are interested 
in are inherently ambiguous. The ambiguities stem from 
many factors such as cultural bias, different frame of refer¬ 
ence and fined grained categorization. This implies that mul¬ 
tiple interpretations of a question are p ossib le. To deal with 
different interpretations of words, [Malinowski and Fritz! 
(2014a) defin e a WUPS scores using l exical databases 
(Milleil fl 9951) with Wu-Palmer similarity dWu and Palmerl 
19941). To deal with different interpretations of a question, 
Malinowski and Fritz! ( 2014c ) suggest that the quality of an¬ 
swers should be measured according to the social consensus 
where the answers are evaluated against multiple ground- 
truths. Interestingly, such metric also naturally quantifies so¬ 
cial agreement of the answer, and serve as a practical ap¬ 
proximation of tedious manual evaluation. 

Experimental scenarios In many cases, success on chal¬ 
lenging learning problems has been accelerated by use of 
external data in the training. We believe that a Visual Tur¬ 
ing challenge should consists of a sub-task with a prohibited 
use of auxiliary data to understand how the holistic learn¬ 
ers generalize from limited and challenging data in a more 


established setup. On the other hand, we should not limit 
ourselves to such artificial restrictions in building the next 
generation of the holistic learners. Therefore open sub-tasks 
with a permissible use of additional sources in the training 
have to be stated, including: additional vision and language 
resources, synthetic data and curated data. 

Summary 

The goal of this contribution is to sparkle discussions 
about challenges and benchmarking architectures on holis¬ 
tic tasks. We also argue that “question answering about 
images” is a holistic task that offers multiple advantages 
over related tasks. For example, it is likely to be less 
prone to “cheating” by over-interpretations than a tradi¬ 
tional Turing Test, the annotation process is tractable by 
crowdsourcing question and answer pairs, and the task 
does not artificially force any internal representation on 
the methods. Our most recent efforts and results on estab¬ 
lishing a Visual Turing Test can be found on our website: 
www.d2.mpi-inf.mpg.de/visual-turing-challenge 
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